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Latent Dirichlet allocation (LDA) is an important probabilistic generative 
model and has usually used in many domains such as text mining, retrieving 
information, or natural language processing domains. The posterior 
inference is the important problem in deciding the quality of the LDA 
model, but it is usually non-deterministic polynomial (NP)-hard and often 
intractable, especially in the worst case. For individual texts, some proposed 
methods such as variational Bayesian (VB), collapsed variational Bayesian 
(CVB), collapsed Gibb’s sampling (CGS), and online maximum a posteriori 
estimation (OPE) to avoid solving this problem directly, but they usually do 
not have any guarantee of convergence rate or quality of learned models 
excepting variants of OPE. Based on OPE and using the Bernoulli 
distribution combined, we design an algorithm namely general online 
maximum a posteriori estimation using two stochastic bounds (GOPE2) for 
solving the posterior inference problem in LDA model. It also is the NP-hard 
non-convex optimization problem. Via proof of theory and experimental 
results on the large datasets, we realize that GOPE2 is performed to develop 
the efficient method for learning topic models from big text collections 
especially massive/streaming texts, and more efficient than previous 
methods. 
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1. INTRODUCTION 


In data mining, one of the most general and powerful techniques is the topic modeling [1]-[3]. 
In recently, there are much published research in the field of topic modeling and applied in various fields 
such as medical and linguistic science. Latent Dirichlet allocation (LDA) [4] is the popular methods for topic 
modeling [5]-[7], LDA has found successful applications in text modeling [8], bioinformatic [9], [10], 
biology [11], history [12], [13], politics [14]-[16], and psychology [17], to name a few. Recently, there are 
much research related to corona virus disease 2019 (COVID-19) pandemic that also use LDA model in data 
analysis. These show the important role and advantage of the topic models in text mining [18]-[20]. We find 
out that the quality of the LDA model is highly dependent on the inference methods [4]. In recent years, many 
posterior inference methods have obtained more attention from scientists such as variational Bayesian (VB) [4], 
collapsed variational Bayesian (CVB) [21], collapsed Gibb’s sampling (CGS) [12], [22], and online maximum a 
posteriori estimation (OPE) [23]. Those methods enable us to easily work with big data [12], [24]. Except 
variants of OPE, most of the methods do not have a guarantee of convergence rate or model quality in theory. 
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We realize that in topic models, the posterior inference problem is in fact non-convex optimization 
problem. It also belongs to class of non-deterministic polynomial (NP)-hard problem [25]. We also find out 
that OPE has the convergence rate is O (1/T) where T is the number of iterations. OPE overcomes the best 
rate of existing stochastic algorithms for solving the non-convex problems [23], [26], [27]. To the best of my 
knowledge, solving the posterior inference problem usually leads to a non-convex optimization problem. 
The big question is how efficiently an optimization algorithm can try to escape saddle points? we carefully 
consider the optimization algorithms applied to the posterior inference problem. It is the basis for us to 
propose the general online maximum a posteriori estimation using two stochastic bounds (GOPE2) 
algorithm. In this paper, we propose the GOPE2 algorithm based on a stochastic optimization approach for 
solving the posterior inference problem. Using the Bernoulli distribution and two stochastic bounds of the 
true non-convex objective function, we have shown that GOPE2 achieves even better than previous 
algorithms. It also keeps the good properties of OPE and continues to do better than OPE. Stochastic bounds 
replacing true objective function reduces the possibility of getting stuck at a local stationary point or escaping 
saddle points. This is an effective approach to get rid of saddle points while existing methods are unsuitable 
especially in high-dimensional non-convex optimization. We use GOPE2 as the core algorithm for doing 
inference, we obtain online-GOPE2 which is an efficient method for learning LDA from large text 
collections, especially short-text documents. Based on our experiments on large datasets, we show that our 
method can reach state-of-the-art performance in both qualities of learned model and predictiveness. 


2. RELATED WORK 
LDA [4] is a generative model for discrete data and modeling text. In LDA, a corpus is composed 
from K topics p = (f;,...,6%), each of which is a sample from Dirichlet (7) which is V-dimensional 
Dirichlet distribution. LDA model assumes that each document d is a mixture of topics and arises from the 
following generative process: 
a) Draw 0, | æ ~ Dirichlet (æ) 
b) For the nt? word of d: 
— Draw topic index Zan | 9g ~ multinomial (04) 
— Draw word Wan | Zan, Ê ~ multinomial (B,,,,) 
For each document, both @,and zg are unobserved variables and are local, 0g E Ax, By E Ay, Vk. 
We find out that each topic mixture 0g = (O41, ---, Oag ) represents the contributions of topics to document d, 
while xj shows the contribution of term j to topic k. LDA model described in Figure 1. 


Figure 1. The graphic model for latent Dirichlet allocation 


According to Teh et al. [28], given a corpus C = {dj,..., dm}, the Bayesian inference (or learning) is 
to estimate the posterior distribution P(z,0,6 | C,a,n) over the latent topic indices z = {Z,,...,Zg}, topic 
mixtures 0 = {0,,...,@y4}, and topics B = (f,,..., 6%). Given a model {6,a}, the problem of posterior 
inference for each document d is to estimate the full joint distribution P( Za, 0a d | B,a). There are many 
research show that this distribution is intractable by direct estimation. Existing methods are usually 
sampling-based or optimization-based approaches, such as VB, CVB, and CGS. VB or CVB estimates the 
distribution by maximizing a lower bound of the likelihood P(d | 6, a), CGS estimate P( zg | d, p, a). 


We consider the maximum a posteriori (MAP) estimation of topic mixture for a given document d: 

0* =argmaxge,, P(0,d | p, a) (1) 
Problem (1) is equivalent to the (2). 

0* =argmaxgea, (dj dj log Xk=1 9% Brj + (@ — 1) Xk- log 8x) (2) 
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We find out that: 
f(O) = Xj dj log Xk=1 Ox Prj + (æ — 1) Ek-1 log Ox 
Is non-concave when hyper-parametera < 1, then (2) is the non-concave optimization problem. Denote: 
g(8) = X; dj log Èk=1 Ox Bj, h(@):= A-a@) Dk=1 10g Ox (3) 


And see that g(@) and h(@) are concave, then: f(@) = g(@) — h(@) as the different concave (DC) function. 
We find that the problem (2) can be formulated as a DC optimization as: 


8° =argmaxges, [g(0) — h(0)] (4) 


There has been active research in the non-convex optimization. Some popular techniques such as 
branch and bound, cutting planes algorithm or DC algorithm (DCA) [29] for solving a DC optimization, but 
they are not suitable when applying in posterior inference (2) in probabilistic topic models. Note that CGS, 
CVB and VB are inference methods for probabilistic topic models. CGS, CVB and VB are popularly used in 
topic modeling, but we have not seen any theoretical analysis about how fast they do inference for individual 
documents. In addition, other candidates include concave-convex procedure (CCCP) [30], online frank-wolfe 
(OFW) [31], stochastic majorization-minimization (SMM) [32]. However, they are not sure about the 
convergence speed of the method and the quality of the model. In practice, the posterior inference in topic 
models is usually non-convex. Applying online-FW for solving a convex problem in [31], a new algorithm 
for MAP inference in LDA namely OFW have proposed by using a stochastic sequence combining with 
1 
Nia 
large datasets, OFW is a good approach for MAP problem and usually better than previous methods such as 
CGS, CVB and VB. Changing the learning rate and considering about theoretical aspect carefully, OPE 
algorithm has proposed. OPE approximates the true objective function f(@) by a stochastic sequence F,(0) 
made up from the uniform distribution, thus (2) is easy for solving. OPE is better than previous methods, but 
we can explore a better new algorithm based on stochastic optimization for solving (2). Finding out the 
limitations of OPE, we improve OPE and obtain the GOPE2 algorithm applying for problem (2). Details of 
GOPE2 is presented in section 3. 


uniform distribution and show that convergence rate of OFW is O ( } Via doing many experiments with 


3. PROPOSED METHOD 

Finding out that OPE is a stochastic algorithm better than others for solving posterior inference. It also 
is quite simple and easily apply, so we improve OPE by randomization to obtain a better variant. We find out 
that the Bernoulli distribution is a discrete probability distribution of a random variable having two possible 
outcomes respectively with probabilities p and 1 — p and a special case of the Bernoulli one when probability 


p= sas called the uniform distribution. We use Bernoulli distribution to construct the approximation 


functions to easily maximize and approximate well for objective function f(@) we consider the problem. 

0* =argmaxges, dj dj log Vk=1 9% Prj + (æ — 1) Xk=1 log, (5) 
We see that: 

91(8) = X; dj log Èk-1 0k Prj < 0, 92(0) = (a — 1) Èk-1l0g9p > 0 then g1 (0) < f (8) < g2 (8). 


Pick f, as a Bernoulli random sample from {g1 (0), g2(0)}, where: 
P( fh = 91) =p, PC fn = 92) = 1-— p and make the approximation F,(@) = “Sher fr 


The stochastic approximation F,(@) is easier to maximize and do differential than f (0). We also see that 
9:(8) < 0, g2(@) > 0. Hence, if we choose fı: = g, then F, (8) < f(@), which leads F,(0@) is a lower bound 
of f(@). In contrast, if we choose f;:= gz then F, (0) > f(@), and F,(@) is a upper bound of f(@). Using 
two stochastic approximation from above and below of f(@) is better than one, we hope that will make the 
new algorithm has a faster converge rate. We use {L+} is an approximate sequences of f(@) and begins with 
g,(9), another called {U,} begins with g,(@). We set f’:= g,(@). Pick ff as a Bernoulli random sample 
with probability p from {g,(6), g2(0)} where P( ff = 9:(6)) =p, PC ff = 92(0))=1- p. 
We have L;: = Sh- f£, Wt = 1,2, ... The sequence {L+} is a lower bound of the true objective f (0). 
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By using an iterative approach, from a random sequence {L;}, we obtain a numerical sequence 
{of} (t = 1,2,...) as: 


ef -O¢ 
t 


ef: = argMaxyeay (Li. (6z), x), Of: es Ot + 


(6) 
Similarly, we set f¥:= g2(0). Pick f/* as a Bernoulli random sample with probability p from 
{91 (0), g2(80)}, where P( f = 9,(9)) =p, P(fë = 92(9)) =1—p, Yt = 2,3,..., we have: 


1 
Up = = Dhar ft! .vt = 1,2,... 


The sequence {U,} is a upper bound of the true objective f(@). From sequence {U+}, we also obtain the 
numerical sequence {0#} (t = 1,2, ...) as: 


u 
et -0t 


er: g argmMAXxesg (U: (8+), x), O41 = 0; + t 


(7) 
We combine two approximating sequences {U,} and {L,}. Based on two sequence {0f} and {0¥}, 
we construct an approximate solution sequence {0+} as: 


6,: = OË with probabili oe d 0,: = O£ with probabili ole) 
t: = 8/' with probability Ton t: = 0g with probability FODA (8) 


Then, we obtain GOPE2 algorithm. The effectiveness of GOPE2 depends on choosing the 0, differently at 
each iteration. Details of GOPE2 are presented in Algorithm 1. 


Algorithm 1. GOPE2 algorithm for the posterior inference 


Input: document d, Bernoulli parameter p € (0,1) and model {£, a} 

Output: 0 that maximizes f (0) = X}; dj log X¥-1 Ox Brj + (@ — 1) Ek- log 9, 
Initialize 04 arbitrarily in Ag 

91(9): = X; dj log Xk-1 Ox Prj; G29): = (a — 1) Xk-1 10g, 

f: = 9:09) ; fi: = 92(8) 

for t = 2,3,...,T 

Pick f randomly from {g1 (0), g2(@)} according to the Bernoulli distribution where: 


P(fË = 9:(9)) =p, P(ft' = 92(8@))=1-p 


1 u 
U: = >J th 
h=1 


et: = argmaXxeag < U' (0+), x > 
Ou: = Oy, + 
Pick f£ randomly from {g,(@), g>(@)} according to the Bernoulli distribution where: 
PCft = (9)) =p, P(fÉ = g2(0)) =1-p 
Lys Shad fr 
ef: =argmaxyer, < L(O,),x > 
t 
Of: = bp, + = 
of (0) 


6,:= 0" with probability q and 6,:= OÉ with probability 1 — q, where q = ao 
ef (OF) + of (8) 


End for 


The interweaving two-bounds of the objective function combine with Bernoulli distribution makes 
GOPE2 behave very differently from OPE. GOPE2 creates three numerical sequences {0}, {0f}, and {04} 
where {0,} depends on {6} and {6/} at each iteration. The sequence {0,} really changes on structure, but the 
good properties of OPE are remained. There are many nice properties of GOPE2 that other algorithms do not 
have. Based on the online-OPE [23] for learning LDA, replacing OPE by GOPE2, we design the 
online-GOPE2 algorithm to learn LDA from large corpora. This algorithm employs GOPE2 to do maximum 
a posteriori estimation for individual documents to infer global variables such as topics. 
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4. EXPERIMENTAL RESULTS 
In this section, we devote investigating GOPE2’s behavior and show how useful it is when GOPE2 
is used as a fast inference method to design a new algorithm for large-scale learning of topic models. 
We compare GOPE2 with other inference methods such as CGS, CVB, VB and OPE. Applying these 
inference methods to construct methods learning LDA such as online-CGS [12], online-CVB [21], online-VB [24] 
and online-OPE. We evaluate the GOPE2 algorithm indirectly via efficiency of online-GOPE2. 
— Datasets 
In our experiments we use two long-text large datasets: PubMed and New York Times 
datasets 1. We also use three short-text large datasets: Tweets from Twitter, NYT-titles from the New 
York Times where each document is the title of an article, Yahoo questions crawled from 
answers. yahoo.com. Details of these datasets are presented in Table 1. 
— Parameter settings 


We set K = 100 as the number of topics, the hyper-parameter a = = and 7 = Z as the topic Dirichlet 


parameter are commonly used in topic models. We also choose T = 50 as the number of iterations. We set 
k = 0.9,t = 1 which are adapted best for inference methods. Performance measures: log predictive 
probability (LPP) [12] measures the predictability and generalization of a model to new data. Normalized 
pointwise mutual information (NPMI) [33] evaluates the semantic quality of an individual topic. 
From extensive experiments, NPMI agrees well with human evaluation on the interpretability of topic 
models. In this paper, we used LPP and NPMI to evaluate the learning methods. Choosing the Bernoulli 
parameter p € {0.30,0.35,..,0.70} and mini-batch size |C,| = 25,000 on two long-text datasets, our 
experimental results are presented in Figure 2. 

In Figure 2, we find out that the effectiveness of online-GOPE2 depends on the value of selected 
probability p and datasets. We also see that on the same measure, the results performed on the New York 
Times dataset are not too different as on the PubMed dataset and on the same dataset, the experimental 
results on the NPMI are different more than on LPP. We also see that our method usually is better than others 
when p ~ 0.7. Dividing the data into smaller mini-batches, |C,| = 5,000 and the parameter Bernoulli 
p E {0.1,0.2, ...,0.9} more extensive. Results of online-GOPE2 on two long-text datasets are presented in 
Figure 3. 

In Figure 3, we find out that online-GOPE2 results depend much on the choose of parameter 
Bernoulli p and mini-batch size |C,|. Through the experimental results, with the mini-batch size as 5,000, 
it gives results better than 25,000. We find out that when the mini-batch size decreases the value of measures 
increases, so the model learned better. Next, we compare online-GOPE2 with other learning such as 
online-CGS, online-CVB, and online-VB. These experimental results on long-text datasets: Pubmed and 
New York Times are showed in Figure 4, we see that online-GOPE2 is better than online-CGS, online-CVB, 
online-VB, and online-OPE on two datasets in LPP and NPMI measures. 


Table 1. Five datasets in our experiments 


Datasets Corpus size Average length per doc Vocabulary size 
PubMed 330,000 65.12 141,044 
New York Times 300,000 325.13 102,661 
Twitter tweets 1,457,687 10.14 89,474 
NYT-titles 1,664,127 5.15 55,488 
Yahoo questions 517,770 4.73 24,420 
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Figure 2. Predictiveness (LPP) and semantic quality 
(NPMI) of the learned models by online-GOPE2 
with mini-batch size |C,| = 25,000 on long-text 

datasets 


Figure 3. LPP and NPMI of the models learned by 
online-GOPE2 with mini-batch size |C,| = 5,000 on 
long-text datasets 
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We also find out that LDA usually do not well on short texts. We provide additional evidence of 
GOPE2’s effectiveness by investigating the effectiveness of the learned model with short texts. We do 
experiments on three short-text datasets: Yahoo, Twitter and NYT-titles. Experimental results of 
online-GOPE2 on three short-text datasets are presented in Figure 5 and Figure 6. 


New York Times 


32 40 #448 56 20 30 40 50 60 
Documents seen (x5,000) Documents seen (x5,000) 
—— Online-OPE —— Online-VB  —+— Online-CVB  —+— Online-CGS —+— Online-GOPE2 


Figure 4. Performance of different learning methods on long-text datasets. Online-GOPE2 often surpasses all 
other methods 


In Figure 6 and Table 2, we see that GOPE2 usually gives better results with parameter Bernoulli p 
chosen small on short-text datasets. Through Figure 6 we also see the model is over-fitting when learning by 
VB and CVB methods. The evidence is that the LPP and NPMI measures of the model by online-VB and 
online-CVB are reduced on three short-text datasets. Whereas, this do not happen for the GOPE2 method and 
variants. We do experiments with different mini-batch size and datasets, we show that our improvements 
usually give better than previous methods. GOPE2 gives better results than other methods because of the 
following reasons. 

— Bernoulli distribution is more general than uniform. Bernoulli parameter p plays a role of the 
regularization parameter, then it makes our model avoid the over-fitting. This explains the contribution 
of prior/likelihood to solving the inference problem. 

— Applying the squeeze theorem when constructing lower bound {L+} and upper bound {U+} of true 
objective function f (0). 


Table 2. Experimental results of some learning methods on short-text datasets 
Datasets Measures Online-GOPE2 _Online-OPE __Online-VB_—_Online-CVB __Online-CGS 


NYT-titles LPP -8.4635 -8.6031 -9.6374 -9.5583 -8.4963 
Twitter LPP -6.4297 -6.6943 -7.6152 -7.1264 -6.8151 
Yahoo LPP -7.7222 -7.8505 -8.9342 -8.8417 -7.8501 
NYT-titles NPMI 4.1256 3.6037 0.6381 1.0348 4.6319 
Twitter NPMI 9.8042 9.3677 5.4541 6.0338 8.0621 
Yahoo NPMI 5.0205 4.4785 1.4634 1.9191 5.0181 
Online-GOPE2 with LPP Online-GOPE2 with NPMI NYTTitles Twitter Yahoo 
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Figure 5. LPP and NPMI of the models learned by Figure 6. Performance of learning methods on 
online-GOPE2 with Bernoulli parameter p and short-text datasets. online-GOPE2 often surpasses 
ICi] = 5,000 on short-text datasets other methods 
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5. 


CONCLUSION 
The posterior inference for individual texts is very important in topic models. It directly determines 


the quality of the learned models. In this paper, we have proposed GOPE2, a stochastic optimization, helping 
the posterior inference problem can be solved well by using Bernoulli distribution and two stochastic 
approximations. In addition, the parameter Bernoulli p is seen as the regularization parameter that helps the 
model to be more efficient and avoid overfitting. Using GOPE2, we have online-GOPE2, an efficient method 
for learning LDA from data streams or large corpora. The experimental results show that GOPE2 is usually 
better than compared methods such as CGS, CVB, and VB. Thus, online-GOPE2 is a good candidate to help 
us deal with big data. 
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